[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis by svij-sc · Pull Request #591 · Snapchat/GiGL

svij-sc · 2026-04-17T23:51:18Z

Summary

Standalone DataAnalyzer module that takes a YAML config pointing at BQ node/edge tables and generates a single self-contained HTML report covering data quality, feature distributions, and graph structure — so engineers can diagnose training data issues in minutes instead of after a failed training run.
4-tier validation: hard fails (dangling edges, referential integrity, duplicate nodes) → core metrics (degree distribution, hubs, cold-start, memory budget, neighbor explosion estimate) → label/heterogeneous (class imbalance, label coverage, edge type distribution) → opt-in advanced (reciprocity, homophily, connected components, clustering).
Thresholds and check selection backed by a literature review of 18 production GNN papers (PinSage, LiGNN, TwHIN, GiGL, BLADE, AliGraph, GraphSMOTE, Beyond Homophily, Feature Propagation, and more). Each threshold cites its source paper.

Changes

gigl/analytics/data_analyzer/ — config.py, types.py, queries.py (18 SQL templates), graph_structure_analyzer.py, feature_profiler.py (stub), data_analyzer.py orchestrator + CLI
gigl/analytics/data_analyzer/report/ — PRD.md, SPEC.md, report_generator.py, and AI-owned report.ai.html, charts.ai.js, styles.ai.css (regenerable from PRD + SPEC)
tests/unit/analytics/data_analyzer/ — 26 unit tests covering config parsing, SQL templates, analyzer orchestration, and HTML snapshot
tests/test_assets/analytics/ — sample_analyzer_config.yaml + golden_report.html snapshot
docs/plans/ — design doc, literature review, 1-pager, engineering spec (all colocated)
pyproject.toml — package-data declaration so .ai.* assets ship in installed wheels

Test plan

uv run python -m unittest discover -s tests/unit/analytics -p "*_test.py" -t . → 26/26 pass
make type_check → clean on 651 files
make check_format → clean
Manual: run analyzer CLI against a real BQ dataset and inspect the generated HTML

v1 scope cuts (follow-up PRs)

FeatureProfiler: TFDV/Dataflow integration is a working stub that logs a warning and returns empty results. The full Beam pipeline wiring (reusing GenerateAndVisualizeStats, IngestRawFeatures, init_beam_pipeline_options from the existing DataPreprocessor) will land in a follow-up PR.
GCS upload: The orchestrator generates the HTML but does not yet upload it; currently returns the target path with a TODO.
Tier 4 advanced queries: Reciprocity, homophily, connected components, and clustering coefficient are not implemented. Power-law exponent is computed as a degree-stats approximation.

Docs

Design doc: docs/plans/20260415-bq-data-analyzer.md
Literature review: docs/plans/20260415-bq-data-analyzer-references.md
1-pager: docs/plans/20260416-data-analyzer-1-pager.md
Engineering spec: docs/plans/20260416-data-analyzer-engineering-spec.md
Report PRD (product intent): gigl/analytics/data_analyzer/report/PRD.md
Report SPEC (technical contract): gigl/analytics/data_analyzer/report/SPEC.md

Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

…sisResult, FeatureProfileResult) Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

Implements the orchestration layer for BQ-based graph data quality checks: - Tier 1 hard-fails (dangling edges, referential integrity, duplicate nodes) raise DataQualityError carrying a partially populated result. - Tier 2 core metrics (counts, degree stats, top-K hubs, INT16 clamp, NULL rates) plus Python-side feature memory and neighbor-explosion estimates. - Tier 3 label/heterogeneous checks auto-enabled by config (label_column presence; multiple edge tables). - Tier 4 opt-in placeholders (power-law exponent from degree stats). Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

…assets Co-Authored-By: shubhamvij <svij@snapchat.com>

Implements the report_generator module that stitches AI-owned template, styles, and chart JS into a single self-contained HTML report by replacing the four INJECT_* placeholders. Adds a golden-file snapshot test (and four structural tests) so future AI-driven edits to the report assets fail fast until the snapshot is regenerated. Registers the *.ai.{html,js,css} assets as package-data so importlib.resources can resolve them from an installed wheel. Co-Authored-By: shubhamvij <svij@snapchat.com>

Implements the main orchestrator class that coordinates graph structure analysis, feature profiling, and HTML report generation. Includes CLI entry point with argparse for analyzer_config_uri and resource_config_uri. Co-Authored-By: shubhamvij <svij@snapchat.com>

…deferred) Co-Authored-By: shubhamvij <svij@snapchat.com>

Narrows the Union return type for mypy in the direct-merge test path. Co-Authored-By: shubhamvij <svij@snapchat.com>

Co-Authored-By: shubhamvij <svij@snapchat.com>

Sits alongside SPEC.md to separate product requirements (why and what) from technical implementation contract (how). Both are AI-owned and together form the input for regenerating report.ai.html, charts.ai.js, and styles.ai.css. Co-Authored-By: shubhamvij <svij@snapchat.com>

… 1-pager, engineering spec Colocates all planning docs for the BQ Data Analyzer feature: - 20260415-bq-data-analyzer.md: full design doc with 4-tier validation, cost control, tradeoff analysis - 20260415-bq-data-analyzer-references.md: literature review of 18 production GNN papers with 100+ findings, common themes, and consolidated threshold table - 20260416-data-analyzer-1-pager.md: executive summary for peer engineers and decision makers - 20260416-data-analyzer-engineering-spec.md: per-layer implementation plan that the analyzer code in this branch follows Co-Authored-By: shubhamvij <svij@snapchat.com>

…trator Previously the orchestrator generated the HTML in memory but left the upload as a TODO, forcing practitioners to copy a Python snippet to see the output. Now DataAnalyzer.run() writes report.html under config.output_gcs_path, detecting the scheme: - gs:// URIs upload via GcsUtils.upload_from_string() - local paths write via pathlib, creating parent dirs as needed Returns the final path (GCS URI or resolved local path) so the CLI can log it and practitioners can open the file directly. Tests cover both local and mocked-GCS paths plus trailing-slash handling. Co-Authored-By: shubhamvij <svij@snapchat.com>

Quickstart-first guide at gigl/analytics/README.md covering: - 3-step quickstart (auth, YAML config, CLI command) with a single entry point that now writes report.html to disk or GCS - Tier summary table (what runs when) - Interpretation table with thresholds + "what to do" actions drawn from the 18-paper literature review - Advanced config keys (opt-in Tier 3/4, label_column, timestamp_column, fan_out) - Python API snippet for programmatic access - graph_validation sub-package pointer - Scope and limitations (FeatureProfiler stub, Tier 4 queries TODO) - Links to design doc, literature review, 1-pager, engineering spec, report PRD, and report SPEC Co-Authored-By: shubhamvij <svij@snapchat.com>

Changes from the review pass: README fixes: - Remove all docs/plans/* links (the plans were intentionally deleted in d3f1eb8). Inline the relevant paper citations into the threshold table so readers aren't pointed at 404s. - Add "Prerequisites" line pointing at the GiGL installation guide so the quickstart doesn't assume uv/deps are already set up. - Mark Tier 4 flags (compute_homophily, compute_connected_components, compute_clustering, timestamp_column) as not-yet-implemented in both the tier table and the Advanced Config section, not only in the Scope section at the bottom. - Add the power-law exponent mention to the Tier 4 row (was only in scope notes; it's actually computed today). - Document the heterogeneous-graph referential-integrity caveat (analyzer currently joins each edge table against node_tables[0]). - Link to tests/test_assets/analytics/golden_report.html so a reader can preview the output before authenticating to BQ. Config fix: - NodeTableSpec.feature_columns: MISSING -> field(default_factory=list) so that nodes with no features are legal. Previously users got a cryptic OmegaConf MissingMandatoryValue error, and no-feature nodes are a real use case. - Add a regression test covering the no-feature-columns case. All 31 analytics unit tests pass. mypy clean. check_format clean. Co-Authored-By: shubhamvij <svij@snapchat.com>

svij-sc and others added 14 commits April 17, 2026 20:25

feat(analytics): scaffold data_analyzer package structure

c079b9f

Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add DataAnalyzerConfig with YAML loading and tests

3988493

Co-Authored-By: shubhamvij <shubhamvij@users.noreply.github.com>

fix(analytics): remove unused imports in config_test.py

cf69b38

Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add result type dataclasses (DegreeStats, GraphAnaly…

8abae4a

…sisResult, FeatureProfileResult) Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add 18 SQL query templates for graph structure analysis

f1c7f52

Co-Authored-By: shubhamvij <svij@snapchat.com>

style(analytics): apply black formatter to test files

793190c

Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add report SPEC.md and initial AI-owned HTML/JS/CSS …

0b01b5c

…assets Co-Authored-By: shubhamvij <svij@snapchat.com>

feat(analytics): add FeatureProfiler stub (TFDV/Dataflow integration …

42f8d78

…deferred) Co-Authored-By: shubhamvij <svij@snapchat.com>

fix(analytics): cast OmegaConf.to_object result in config_test

56eb170

Narrows the Union return type for mypy in the direct-merge test path. Co-Authored-By: shubhamvij <svij@snapchat.com>

style(analytics): apply isort and mdformat to data_analyzer files

7f387f6

Co-Authored-By: shubhamvij <svij@snapchat.com>

svij-sc requested review from kmontemayor2-sc, mkolodner-sc, nshah-sc, xgao4-sc, yliu2-sc and zfan3-sc as code owners April 17, 2026 23:51

svij-sc added 2 commits April 17, 2026 23:51

delete plans

d3f1eb8

svij-sc changed the title ~~feat(analytics): add BQ Data Analyzer for pre-training graph data analysis~~ Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026

svij-sc changed the title ~~Feature Analytics: Add Data Analyzer for pre-training graph data analysis~~ [WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis Apr 18, 2026

svij-sc added 4 commits April 20, 2026 18:01

tfdv

826c893

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591

[WIP] Feature Analytics: Add Data Analyzer for pre-training graph data analysis#591
svij-sc wants to merge 20 commits intomainfrom
svij/easy-analyz-bq

svij-sc commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

svij-sc commented Apr 17, 2026

Summary

Changes

Test plan

v1 scope cuts (follow-up PRs)

Docs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant